This is an exploratory data analysis about the relationships between diabetes, obesity, leisure time inactivity, and the per capita income in the US counties. If you are interested, how the analysis was done see the file code.html.
The US county income data is from Wikipedia page List of United States counties by per capita income. This diabetes dataset is from CDC Diabetes page County Data Indicators. The dataset contains statistics about diabetes, obesity, and leisure time inactivity in US counties between 2004-2013. Here, the data from year 2013 was used. Both datasets were downloaded on 2/3/2017.
The county names in both dataset were cleaned and made compatible so that the dataset could be joined.
To examine the possible relationship between diabetes prevalence and per capita income in different US counties, we join the datasets using the state and county.
We plot the data as a scatterplot having one point for each county. The colors of the points are based on the population size. A linear regression line is shown too.
There seems to be a pattern that diabetes prevalence is higher in counties with smaller per capita income.
Similarly we join the obesity data with the income data, and plot it.
Also here, we seem to have be a pattern that obesity prevalence is higher in counties with smaller per capita income.
Next we join the leisure time inactivity data with the income data, and plot it.
The pattern is similar as in the cases above.
It is interesting to see, if there is a relationship between obesity and diabetes is. To do this, we join the obesity and diabetes data, and plot the result.
Higher prevalence of obesity seems to be related to higher prevalene of diabetes.
Similarly we examine the relationship between leisure time inactivity and diabetes.
Similar trend can be seen as above with obesity vs. diabetes.
Finally, we examine the relationship between leisure time inactivity and obesity.
Also here the trend is similar. Higher prevalence of leisure time inactivity seems to be related to higher prevalence of obesity.
Lots of manual work was caused by the different forms of county names. There were systematic differences like omission of the word ‘County’ in the county names in the Wikipedia data and non-systematic differences like ‘De Witt’ vs. ‘DeWitt’.
The plotly package was not completely easy to use. One particularly strange feature came out, when I added the linear regression lines to the pictures. For some reason, plotly wanted to add the marker points to the regression line too (projecting the x-values of the original points to the regression line). These markers completely hid the actual line. It did not matter, whether I defined mode="lines" or mode="lines+markers". Finally I tried mode="lines-markers" and it removed the markers! However, I did not find this in the documentation.